SemanticScuttle - klotz.me » Tags: machine learning+data science

The Pearson Correlation Coefficient, Explained Simply

A simple explanation of the Pearson correlation coefficient with examples

2025-11-03 Tags: statistics, data science, machine learning, python, pearson correlation, regression by klotz

Prompt Engineering for Time-Series Analysis with Large Language Models

This article explores how prompt engineering can be used to improve time-series analysis with Large Language Models (LLMs), covering core strategies, preprocessing, anomaly detection, and feature engineering. It provides practical prompts and examples for various tasks.

2025-10-16 Tags: llm, prompt engineering, time series, forecasting, anomaly detection, feature engineering, data science, machine learning, production engineering, observability by klotz

I Was Wrong: Start Simple, Then Move to More Complex

The author discusses a shift in approach to clustering mixed data, advocating for starting with the simpler Gower distance metric before resorting to more complex embedding techniques like UMAP. They introduce 'Gower Express', an optimized and accelerated implementation of Gower.

2025-09-05 Tags: clustering, data science, machine learning, gower distance, umap, gower express, mixed data, python, scikit-learn, data analysis, shrunk by klotz

A Visual Guide to Tuning Random Forest Hyperparameters

This article explores the impact of hyperparameters on random forests, both in terms of performance and visual representation. It compares the performance of a default random forest with tuned decision trees and examines the effects of various hyperparameters like `n_estimators`, `max_depth`, and `ccp_alpha` using visualizations of individual trees, predictions, and errors.

2025-09-05 Tags: data science, machine learning, random forests, hyperparameter tuning, python, data visualization, scikit-learn, decision trees, james gibbins by klotz

Using Google’s LangExtract and Gemma for Structured Data Extraction

Extracting structured information effectively and accurately from long unstructured text with LangExtract and LLMs. This article explores Google’s LangExtract framework and its open-source LLM, Gemma 3, demonstrating how to parse an insurance policy to surface details like exclusions.

2025-08-27 Tags: data science, large language models, llm, machine learning, structured data, langextract, gemma, data extraction by klotz

Exploring NotebookLM Alternatives

This article explores alternatives to NotebookLM, a Google assistant for synthesizing information from documents. It details NousWise, ElevenLabs, NoteGPT, Notion, Evernote, and Obsidian, outlining their key features, limitations, and considerations for choosing the right tool.

2025-08-06 Tags: notebooklm, llm, alternatives, nouswise, elevenlabs, notegpt, notion, evernote, obsidian, data science, machine learning, productivity by klotz

Accuracy Is Dead: Calibration, Discrimination, and Other Metrics You Actually Need

A deep dive into advanced evaluation for data scientists, discussing why accuracy is often misleading and exploring alternative metrics for classification and regression tasks like ROC-AUC, Log Loss, R², RMSLE, and Quantile Loss.

2025-07-18 Tags: data science, calibration, discrimination, roc-auc, log loss, regression, classification by klotz

AI Nexus

AI Nexus is a platform for collaboration, knowledge exchange, and groundbreaking discourse in AI. It features upcoming AI events, speaker series, and faculty contributions to the global AI community. The site also provides information on MBZUAI programs and opportunities for collaboration.

2025-05-21 Tags: ai, events, mbzuai, machine learning, digital public health, data science, elizabeth churchill by klotz

Accelerate Deep Learning and LLM Inference with Apache Spark in the Cloud

This article details how to accelerate deep learning and LLM inference using Apache Spark, focusing on distributed inference strategies. It covers basic deployment with `predict_batch_udf`, advanced deployment with inference servers like NVIDIA Triton and vLLM, and deployment on cloud platforms like Databricks and Dataproc. It also provides guidance on resource management and configuration for optimal performance.

2025-05-09 Tags: data science, deep learning, llm, apache spark, nvidia, rapids, triton, vllm, databricks, dataproc, mlops by klotz

10 Python One-Liners for Machine Learning Modeling

The article showcases concise Python code snippets (one-liners) for common machine learning tasks like data splitting, standardization, model training (linear regression, logistic regression, decision tree, random forest), and prediction, leveraging libraries such as scikit-learn.

| **#** | **One-Liner** | **Description** | **Library** | **Use Case** |
|-----|-----------------------------------------------------|-------------------------------------------------------------------------------------|-------------------|-------------------------------------------------|
| 1 | `from sklearn.datasets import load_iris; X, y = load_iris(return_X_y=True)` | Loads the Iris dataset, a classic for classification. | scikit-learn | Loading a standard dataset. |
| 2 | `from sklearn.model_selection import train_test_split; X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)` | Splits the dataset into training and testing sets. | scikit-learn | Preparing data for model training & evaluation.|
| 3 | `from sklearn.linear_model import LogisticRegression; model = LogisticRegression(random_state=1)` | Creates a Logistic Regression model. | scikit-learn | Binary Classification. |
| 4 | `model.fit(X_train, y_train)` | Trains the Logistic Regression model. | scikit-learn | Model training. |
| 5 | `y_pred = model.predict(X_test)` | Predicts labels for the test dataset. | scikit-learn | Making predictions. |
| 6 | `from sklearn.metrics import accuracy_score; accuracy = accuracy_score(y_test, y_pred)` | Calculates the accuracy of the model. | scikit-learn | Evaluating model performance. |
| 7 | `import pandas as pd; df = pd.DataFrame(X, columns=iris.feature_names)` | Creates a Pandas DataFrame from the Iris dataset features. | Pandas | Data manipulation and analysis. |
| 8 | `df 'target' » = y` | Adds the target variable to the DataFrame. | Pandas | Combining features and labels. |
| 9 | `df.head()` | Displays the first few rows of the DataFrame. | Pandas | Inspecting the data. |
| 10 | `df.describe()` | Generates descriptive statistics of the DataFrame. | Pandas | Understanding data distribution. |

2025-04-26 Tags: python, machine learning, one-liner, scikit-learn, linear regression, logistic regression, decision tree, random forest, data science, modeling by klotz

SemanticScuttle - klotz.me

Tags: machine learning* + data science*

Linked Tags

Related Tags